This project was created for Doctor Ken Steif’s MUSA 507 course in Fall 2018.

Report Content

1. Introduction 2. Data 3. Method of Variable Selection and The Finalised Model 4. The Final Model: Regression Results and Discussion 5. Cross-validation Checks and Discussion 6. Extra Credit: Spatial Cross Validation 7. Concluding Remarks and Reflections on Class Discussion

1. Introduction

Nashville is among the fastest growing real estate markets in the country, with rising home prices and and increasing influx of people from out of state. Additionally, numerous factors make this place more attractive to would-be residents, including job opportunities, quality of life, amenities and redevelopment. With a growing popularity of people interested in buying homes in Nashville, prospoective homeowners find it important to accurately predict the value of home prices in order to make sure they end up on positive equity for their investments. As planners, it is important to study the changing patterns in the housing market and accurately idenfity where these changes occur in order to better plan for inclusive, equitable and thriving cities. By expecting where prices changes are likely occur, planners can prepare for stronger policies that will both support the growing market while also consider the people’s ability to afford living in that area. Stricter policies may restrict homeowners from raising too much of their house values for sale, keeping from only the wealthy to own properties in the region; where sale prices are low, planners can look for development opportunites to make the region more attractive for new homeowners to invest in. With increased population projections for Nashville in the upcoming years1, it is important for planners to appropriately be able to accommodate the growing number of people with affordable home prices.

The purpose of this project is to built a predictive model of home prices in Nashville using machine learning algorithms. Using open data sources such as Zillow, Nashville Open Data, and the US Census Bureau, the information was aggragated and analyzed using R.

An overall 5 Step Process was used to create a final prediction model for home sale prices. The general 5 Step Process included:

  1. Data Collection
  2. Data Processing
  3. Model Buidling
  4. Evaluation of Models
  5. Final Model Selection

Underlying this process is a conscious consideration of the Modifiable Areal Unit Problem (MAUP). Therefore, where possible, our data processing preferred point pattern analysis over polygon based analysis. In other words, spatial factors were based on the distance from each observation point rather than being aggregated by zone.

Several limitations made this exercise difficult to design. Our limited domain knowledge of the area made it challenging to determine which other factors beyond physical house properties more greatly influenced home prices. With a narrow understanding of how certain predictors are valued in relation to house prices, it was difficult to engineer the variables appropriately to adequately represent how variable distributions might approximate reality. Another difficulty for predicting house prices was tradeoff between accuracy and generalizability - while fitting our preliminary models, there were numerous instances of models predicting observed prices accurately in the training set, but not in the test set. Finally,the greatest challenge was to simplify the complexity of the housing market into a simple OLS Regression Model. Our understanding was that the relationships underlying house price phenomena are not commonly simple and linear - translating such relationships within the bounds of OLS assumptions proved to be a tricky process.

In brief, the final model showed that the most significant predictors for house prices was a combination of both physical property and geodemographical characteristics. It was found that the final predictive model accurately predicted on observed values but innaccurately predicted for unobserved preditions. In conclusion, it was found that although the model had significant predictors for house prices, it was not generalizable.

2. Data

2.1 Data Collection

The initial dataset provided was limited to home owner information and physical house characteristics. With the understanding that house price values are also dependent upon other external spatial factors, we sought to expand the range of the current predictor variables to adequately represent the effect of space on house prices. To gather additonal data, we explored open data platforms such as the Nashville Open Data and the US Census Bureau. From these, additional datasets containing information on neighborhood characteristics, internal household composition, schools location, and police incidence accounts were scraped and processed.

The table below presents the complete list of 36 predictors considered for our final model. Ultimately, only 10 predictors were fitted, as highlighted in purple (for property characteristics) and in red (for spatial characteristics).

Predictors
Property Predictors
Location Zip
Location City
Building type
Story height
Exterior wall material
Frame type
Central Air System
Heating type
Foundation type
Year built
Effective year
Physical Depreciation
Acrage
Number of rooms
Number of bedrooms
Number of other rooms
Number of bathrooms
Number of half baths
Total number of bathrooms
Spatial Predictors
Local Owner
Distance to public school
Distance to good school
Distance to Vanderbilt University
Distance to Business Improvement District (BID)
Distance to airport
Police incidents
Commercial development within mile-vicinity
Residential development within mile-vicinity
Distance to bus stops
North or South of river
Road zones
Median income
Percentage Black Community
Number of Establishments within mile-vicinity
Pay per Employee
Employment

2.2 Data Processing

A. Feature Engineering Predictive Variables
Predictive Property Characteristics

Upon exploring the given dataset, it was concluded that several of the column characteristcs could be aggregated into simpler predictor columns. For example, the Local Owner predictor was derived from a series of data manipulation from the raw dataset, where the columns showing Owner City and Location City were compared.

A similar process of addidng and subtracting values was used to further simplify internal house characteristics such as Number of Bedrooms, Number of Bathrooms, and Number of Other Rooms.

Predictive Geodemographic Characteristics

Despite the limitations of our domain knowledge about Nashville, through research we included more predictor variables that were thought to be influential factors in determining house prices. The following summarizes the process of how geodemographic variables in the final model were selected and engineered to appropriately fit the model.

  • Median Income: Each property point is associated with the median income of the block group it resides within. Median income is often thought to be a significant indicator in determining house prices, as the house price to income ratios reflect a measure of affordability. As such, higher income households are more likely to be able to afford higher priced homes while lower income households are less likely to buy more expensive homes.

  • Percent of Black Communities: Each property point is associated with the percentage size of Black communities of the block group it resides within. Gentrification is a well known concept among all city neighborhoods and certainly and issue to be considered. With the understanding that Nashville’s Race and Ethnicity is mostly characterized by a white dominant population, we sought to represent the lower minority groups, and explore whether or not that may an influential factor in driving home sale prices. Whereas 78% of the total population is white, only 15% is black or african american, with the rest of the population being composed on other minority groups such as asian and hispanic.2 Due to limited data on the lower minority populations, the percentage of black communities by block groups was used for the analysis.

  • Pay Per Employment: This predictor was created by dividing the total payment by census block groups per total number of employees in the same block group. This predictor characterizes the different employment types by distinguishing between high paying and low paying jobs. It assumes that white collar jobs yield a higher pay per employee than does blue collar jobs. By understanding how these points were distributed along the region, it was predicted that higher pay per employee be correlated with higher house prices as people would be willing to pay more to live within close distance to their workplaces as opposed to those who are paid less and would therefore not value living within close proximity to their workplaces as much.

  • Zones by Road: By overlaying the Sale Price maps onto a satellite map of the region, it was possible to see another spatial pattern based on the major roads going through Davidson County. Nothing that this could possibly be another spatial predictor influencing home prices, the county was classified based on a visual observation into road zones. Using ESRI’s ArcMap, Sale Prices points were overlayed on the county and roads layers. The zones were manually drawn out using the roads as boundaries.

  • Residential Development within mile-vicinity: The number of residential permit applications approved within the mile-vicinity of the property is calculated and used as a proxy representation of the extent of residential development. Increasing the supply of housing in an area possibly indicates increased popularity for that area. Thus, increased competition to own a house in that area ultimately drives the house values to go up. Residential developments, were therefore considered to be a significant factor to influence the prices of homes.

For more explanation on how the other predictor variables (not used in the final model) were selected, please refer to the appendix below.

B. Predicting for NA’s (Imputation)

As with all data, our data included null values or 0 values which accounted for NA’s. Due to the large number of observations with missing information, we attempted to predict for these values through the process of multiple hot-deck imputation, under the assumption that the null values were missing at random. This imputation method imputes the target value from observations similar in terms of other variables - the assumption here is that observations similar in terms of other variables will likely yield a similar value in the missing one.

The advised threshold for imputing missing values is at 5% for large datasets3. While most of our missing observations fell slighly above this threshold (around 6-7%), only one variable (Acrage) fell far beyond at 39%. It was hoped that the imputations for the other predictors would yield better results. We created two identical datasets - Dataset X without imputed values and Dataset Y with imputed values. Both were tested for when designing the final prediction model.

However, it was noted that missingness in this dataset was not at random. Instead, observations with systemic similarities yielded similar extent of missingness. This means that imputation methods might not improve fit, but worsen bias instead. Ultimately, for these reasons, we decided not to present predictions based off the imputed dataset.

3. Method of Variable Selection and The Finalised Model

To select variables from our initial set of 36, we began first by running a Kitchen Sink regression model where all 36 predictors were incorporated. Using this as a base model, the regression model summary allowed us to quickly eliminate insignificant factors and work with those who were considered to be significant predictors of house prices. This process was performed by looking at the p-value. The predictors with a p-value of more than 0.05 are considered to be insignificant predictors of house sale prices - this means that the relationship estimated is likely by chance, and will not be observed in other situations using other observations of house sales.

In testing out different combinations of predictors, different models were compared in their ability to explain the observed variations in house sale prices based on the R^2 values and RMSE (Root Mean Square Error) diagnostics. Firstly, the R^2 value represents a ‘goodness of fit’ measure for the linear regression model - a R^2 value directly represents the amount of price variations explained by the predictors. Secondly, the RMSE indicates how inaccurately the model predicts house sale prices based on the actual observed prices. A large RMSE indicates that the values predicted by the model deviate largely from the actual reality. Therefore, guiding our decision on a finalised model for further cross-validation tests is the ideal of a high R^2 value and a low RMSE.

Our final model sought to predict House Sale Price as a function of 1. Size of Property (in acres) 2. Number of Bedrooms 3. Number of Bathrooms 4. Number of Other rooms 5. Size of Black Community 6. Median Income of Area 7. Whether it is locally-owned 8. Annual Pay Per Employee in Area 9. Road Zone 10. Residential Development in Area. This following figures present the summary and exploratory statistics of the final fitted variables.

Summary Statistics: The Average Property in Nashville

The table below presents the central tendency of variables fitted in the model - in other words, it reflects the average house characteristic and spatial situation in Nashville.

It can be observed that while the average house price in Nashville at $290258 is not exceedingly high, the large standard deviation indicates a large disparity in house prices in Nashville. There are a large spectrum of house prices in this city, and the central average house price cannot be used to make a generalisation that Nashville has largely affordable housing properties.

The average property in Nashville is 0.23 acres in size - again, the large standard deviation indicates that this statistic is not generalisable across Nashville. This average property is likely to have 3 bedrooms, 2 or 3 bathrooms, and 3 other rooms that are neither bedrooms or bathrooms.

The average property in Nashville tends to reside within a block group that is 27.4% Black, with a median income of $55973, in the south east part of the city. It is very likely owned by someone residing in Nashville instead of in other cities in the rest of the United States. It is also typically surrounded by a high number of potential residential developments, with 159 issued and approved permits for future residential projects. It typically resides in an area with a relatively high economic pay-off, with employees working in its area earning around $443100 per year.
Summary of Variables
Variable Central Tendency Standard Deviation
Dependent Variable
Sale Price $290258 333876.2
Predictive Property Characteristics
Size of Property 0.23 Acres 0.72
Number of Bedrooms Three Bedrooms (48.5%)
Number of Bathrooms 2 or 3 Bathrooms (63.1%)
Number of Other Rooms 3 Other Rooms (34.6%)
Predictive Geodemographic Characteristics
Size of Black Community in Block Group 27.4% 25.7
Median Income of Block Group $55973 26957.9
Annual Pay Per Employee in Area $4431000 1070325
Local Owner Yes (84.1%)
Road Zone South East (22.9%)
Number of Residential Development in Mile-vicinity 156.9 149.1

The Observed Phenomenon: How does house sale prices vary across Nashville?

The interactive map below presents the distribution of house sale prices across Nashville. To better visualise the relative differences in house sale prices between properties, the logged Sale Price is also presented as a comparison layer - you can toggle between the two layers to observe this distribution for yourself!

From this interactive map, we can observe that similar house prices are often clustered spatially together. Three big spatial clusters of high house sale prices can be immediately observed north and south of the river, as well as at the southern boundary of the city.

## tmap mode set to interactive viewing

Some interesting variables

The spatial distributions of three predictors - Residential Development in area, Number of Bedrooms, Median Income of area - fitted in the model are presented likewise in interactive maps below. To aid your own visual exploration, we added the Sales Price layer for you to toggle between each predictor of interest and sales price!

From these maps, it can be observed that the Predictive Property Characteristic (Number of Bedrooms) seem to be distributed randomly across Nashville. On the other hand, the other two Predictive Geodemographic Characteristics display clear spatial variations. This indicates that such variables play an important role in driving the spatially-clustered patterns of house sale prices we observed in the previous map.

4. The Final Model: Regression Results and Discussion

The table below presents the regression model results. The coefficient estimate indicates how an increase in each continuous variable, or how each category in a categorical variable, is estimated to change house price. For instance, an acre-increase in the size of property is estimated to increase house price by $18000, while a one-bedroom property is estimated to be $212000 cheaper than a property with more than 4 bedrooms.

Most of the variables are significant predictors of house prices. However, it can be noted that the effect of certain categories in the categorical Property Characteristics have insignificant effect on house prices.

Training Set Model Results
Variable Coefficient Estimate Standard Error T-Statistic Associated P-value
Constant Price
(Intercept) 222368.54 100428.22 2.21 0.03
Predictive Property Characteristics
Acrage.x 18454.67 7394.32 2.50 0.01
NumBedRoomsNO BEDROOM 92072.57 98565.31 0.93 0.35
NumBedRoomsONE BEDROOM -212677.33 23580.49 -9.02 0.00
NumBedRoomsTHREE BEDROOM -55421.98 10710.10 -5.17 0.00
NumBedRoomsTWO BEDROOM -122863.69 12680.23 -9.69 0.00
NumBath2 or 3 Bath 42742.17 9457.45 4.52 0.00
NumBath4 Bath 285088.80 23329.49 12.22 0.00
factor(otherrooms.x)1 17744.76 97678.35 0.18 0.86
factor(otherrooms.x)2 -80221.94 97078.21 -0.83 0.41
factor(otherrooms.x)3 -80822.59 97234.14 -0.83 0.41
factor(otherrooms.x)4 -54339.44 97429.72 -0.56 0.58
factor(otherrooms.x)5 -8059.46 98197.95 -0.08 0.93
factor(otherrooms.x)6 24136.31 100126.03 0.24 0.81
factor(otherrooms.x)7 96037.12 106400.41 0.90 0.37
factor(otherrooms.x)8 84114.88 112775.04 0.75 0.46
factor(otherrooms.x)9 269497.01 152806.68 1.76 0.08
factor(otherrooms.x)10 245179.89 173815.71 1.41 0.16
factor(otherrooms.x)11 1500824.44 304528.90 4.93 0.00
factor(otherrooms.x)12 401173.78 304321.41 1.32 0.19
factor(otherrooms.x)15 75592.51 303268.31 0.25 0.80
Predictive Geodemographic Characteristics
PctBlack -511.37 181.81 -2.81 0.00
MedInc 0.42 0.17 2.44 0.01
PayPerEmp 4051.34 548.42 7.39 0.00
LocalOwnedYES -23237.88 9725.78 -2.39 0.02
RoadZoneEAST -146698.21 23241.61 -6.31 0.00
RoadZonemiddle 64598.10 24740.33 2.61 0.01
RoadZonenorth -117438.66 21293.32 -5.52 0.00
RoadZoneriverNorth -103501.53 18739.18 -5.52 0.00
RoadZoneriverSouth -87972.55 16941.85 -5.19 0.00
RoadZonesEast -104714.03 14717.68 -7.11 0.00
RoadZonesouth 94123.48 20179.89 4.66 0.00
RoadZonesRight -47714.33 19357.90 -2.46 0.01
RoadZonesWest 79444.37 20074.74 3.96 0.00
BuildingResiCount 225.39 31.87 7.07 0.00
Note:
Overall Model Diagnostics
1 If associated P-value in table is more than 0.05, the factor variable is not a statistically significant predictor.
2 Number of observations in training set: 9001
3 Residual standard error: 282700 on 8966 degrees of freedom
4 Adjusted R-squared: 0.296 - this model explains 29.6% of the variability in house prices observed in Nashville.
5 Associated P-value from F-test is approximately zero - this means that it is unlikely that the variables in the model are not significant.

The plot below shows how values fitted by the model vary by the actual house price values. A perfectly accurate model would have resulted in all the points lying on the linear slope line in the plot. However, as this plot clearly shows, our model was not such a model. Instead, there seems to be a pattern behind the extent of inaccuracy in our model - larger house price values tend to be predicted as lower values, and this discrepancy increases as the house price values increases. In other words, our model is bad for accurately predicting high house price values.

5. Cross Validation Checks and Discussion

After fitting the final model, we are interested to find out how useful this model can be used to predict unseen house prices in Nashville. A model that can accurately predict unseen values is described as a generalisable one. To evaluate this, we cross validated the final model on ‘unseen’ data randomly selected from the Train dataset. This was a useful method to avoid creating an overfit model for the specific Train dataset we were working with. In simpler terms, this method partitioned a our Train data into sub-test and sub-train datasets, trained the model on the sub-train data and tested the model on the sub-test data, iterating the same process over k number of times specified. For our model, we cross-validated the model on 100 randomly partitioned sets of house sale price observations.

From this method, the diagnostic values of Mean Absolute Error and Mean Absolute Percentage Error indicate the average extent of predictive inaccuracy across the 100 iterations of model-fitting. High values for these two diagnostics point towards poor generalisability. In other words, such a model will not be effective in accurately predicting the unobserved house prices in Nashville should it be applied to study this phenomena in future.

Unfortunately, the table below indicates that our model is as such. It can be observed that there is a large variance in MAE and R2 of the different training folds during the cross-validation process. This suggests that the model is inconsistent in its accuracy performance - it is accurate sometimes, but when it is inaccurate, the inaccuracy is severe.

Error Diagnostics for Cross-Validated Training Set
Diagnostics
Standard deviation of MAE
ValidMAEsd 30277.834
Mean of MAE
ValidMAEmean 138298.412
Standard deviation of R2
ValidR2sd 0.149
Mean of R2
ValidR2mean 0.388

The map below allows you to compare between the distribution of observed and predicted house prices.

## Variable "regALL$fitted.values" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Further in-sample cross validaion checks

25% of the observations in the training set is further retrieved to create an in-sample test set for additional cross-validation. This test set comprises 2250 observations.

From the table below, the error diagnostics also indicate poor fit and generalisability. What is interesting to note is the high standard deviations of Absolute Percentage Error (APE) in this test set. This suggests, again, that the model is inconsistently accurate.

Diagnostics for in-sample test set
Diagnostics
R2
0.48
Mean Absolute Error (MAE)
134222.04
Mean Absolute Percentage Error (MAPE)
0.99
Standard Deviation of APE
6.29
Minimum APE
0.00
Maximum APE
256.40

The plot below illustrates this point. There seems to be a low Absolute Percentage Error for most of the observations in the test set. The high MAPE on the overall is the result of some severe inaccuracies when predicting certain observations.

Are the error residuals spatially clustered?

From the Moran Test results (below), there is a low but significant spatial autocorrelation in the distribution of error residuals across Nashville. This suggests that our model left out some significant spatial features that account for the spatial distribution of house prices. Identifying these features could improve our model performance.

## Warning in knearneigh(df_trainTest1, k = 8): knearneigh: identical points
## found
## 
##  Moran I test under randomisation
## 
## data:  df_trainTest1$`regTest$residuals`  
## weights: nb2listw(w)    
## 
## Moran I statistic standard deviate = 8.3341, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      7.922387e-02     -4.466280e-04      9.138632e-05
## Variable "regTest$residuals" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

The following plots also show how errors in prediction are greater for certain parts of Nashville. This reiterates the point made above - the inaccuracy of our model seems to stem from the omission of important spatial features in our final model.

The plot below suggests that the model seems to be most inaccurate when predicting for neighbourhoods with low house sale prices.

## Warning: Removed 22 rows containing non-finite values (stat_smooth).
## Warning: Removed 22 rows containing missing values (geom_point).
## Warning: Removed 22 rows containing missing values (geom_text).

6. Extra Credit: Spatial Cross-Validation

dfx2$incomeClass<-dfx2$MedInc

dfEC<-as.data.frame(dfx2)
dfEC<-subset(dfEC, dfEC$test==0)


dfEC_highincome<-subset(dfEC, dfEC[,101]>65938)
dfEC_lowincome<-subset(dfEC, dfEC[,101]<=38892)
dfEC_medincome<-subset(dfEC, dfEC[,101]>38892 & dfEC[,101]<=65983)

dfEC_NOhigh<-rbind(dfEC_lowincome,dfEC_medincome)
dfEC_NOmed<-rbind(dfEC_highincome,dfEC_lowincome)
dfEC_NOlow<-rbind(dfEC_highincome,dfEC_medincome)

fEC<-SalePrice.x ~ Acrage.x + bedroomsunits_building.x + 
  otherrooms.x + baths.x + LocalOwned + BuildingResiCount + 
  RoadZone + PctBlack + MedInc + PayPerEmp

library(caret)
library(dplyr)

#Predict for High income neighborhood
regression_NOhigh<-lm(fEC, dfEC_NOhigh)
summary(regression_NOhigh)
## 
## Call:
## lm(formula = fEC, data = dfEC_NOhigh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -606033 -110687  -29476   43785 4210302 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.737e+04  2.778e+04   0.985 0.324550    
## Acrage.x                  5.304e+04  5.606e+03   9.460  < 2e-16 ***
## bedroomsunits_building.x -1.765e+04  4.728e+03  -3.734 0.000190 ***
## otherrooms.x             -1.175e+04  3.302e+03  -3.558 0.000377 ***
## baths.x                   3.677e+04  5.402e+03   6.806 1.09e-11 ***
## LocalOwnedYES            -2.635e+04  8.329e+03  -3.163 0.001568 ** 
## BuildingResiCount         2.494e+02  2.767e+01   9.014  < 2e-16 ***
## RoadZoneEAST             -2.306e+05  2.111e+04 -10.924  < 2e-16 ***
## RoadZonemiddle            2.666e+04  4.123e+04   0.647 0.517837    
## RoadZonenorth            -1.369e+05  1.685e+04  -8.124 5.33e-16 ***
## RoadZoneriverNorth       -1.415e+05  1.541e+04  -9.178  < 2e-16 ***
## RoadZoneriverSouth       -1.417e+05  1.417e+04  -9.999  < 2e-16 ***
## RoadZonesEast            -1.433e+05  1.279e+04 -11.204  < 2e-16 ***
## RoadZonesouth            -2.054e+02  2.019e+04  -0.010 0.991883    
## RoadZonesRight           -1.352e+05  1.851e+04  -7.302 3.15e-13 ***
## RoadZonesWest            -1.538e+04  1.947e+04  -0.790 0.429743    
## PctBlack                 -3.810e+02  1.485e+02  -2.566 0.010299 *  
## MedInc                   -1.051e-01  2.938e-01  -0.358 0.720511    
## PayPerEmp                 8.201e+03  5.512e+02  14.878  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256600 on 6745 degrees of freedom
## Multiple R-squared:  0.1917, Adjusted R-squared:  0.1895 
## F-statistic: 88.85 on 18 and 6745 DF,  p-value: < 2.2e-16
regPredHigh<-predict(modelFINAL,dfEC_highincome)

regPredVal_high<-
  data.frame(observed=dfEC_highincome$SalePrice.x,
             predicted=regPredHigh)

regPredVal_high<-
  regPredVal_high%>%
  mutate(error=predicted-observed)%>%
  mutate(absError=abs(predicted-observed))%>%
  mutate(percentAbsError=abs(predicted-observed)/observed)

head(regPredVal_high)
##   observed predicted     error  absError percentAbsError
## 1   537000  445343.9 -91656.08  91656.08       0.1706817
## 2   285000  365360.8  80360.85  80360.85       0.2819679
## 3   221000  601886.7 380886.70 380886.70       1.7234692
## 4   370000  297094.7 -72905.32  72905.32       0.1970414
## 5   376000  556341.4 180341.35 180341.35       0.4796313
## 6   250000  465061.1 215061.11 215061.11       0.8602444
#MAE
mean(regPredVal_high$absError)
## [1] 187940.4
#MAPE
mean(regPredVal_high$percentAbsError)
## [1] 0.5151959
summary(regPredVal_high$percentAbsError)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.00116   0.14553   0.30563   0.51520   0.53743 113.06660
#Predict for Med income neighborhood
regression_NOmed<-lm(fEC, dfEC_NOmed)
summary(regression_NOmed)
## 
## Call:
## lm(formula = fEC, data = dfEC_NOmed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -879662 -145885  -50514   53426 6227799 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.432e+05  3.861e+04   3.708 0.000212 ***
## Acrage.x                  6.806e+04  9.940e+03   6.847 8.58e-12 ***
## bedroomsunits_building.x -1.418e+04  7.236e+03  -1.960 0.050080 .  
## otherrooms.x             -8.308e+03  4.556e+03  -1.824 0.068285 .  
## baths.x                   9.384e+04  7.929e+03  11.835  < 2e-16 ***
## LocalOwnedYES            -6.549e+04  1.432e+04  -4.574 4.91e-06 ***
## BuildingResiCount         1.950e+02  4.618e+01   4.222 2.47e-05 ***
## RoadZoneEAST             -1.253e+05  4.468e+04  -2.804 0.005069 ** 
## RoadZonemiddle            5.223e+04  2.977e+04   1.755 0.079397 .  
## RoadZonenorth            -1.276e+05  3.259e+04  -3.915 9.17e-05 ***
## RoadZoneriverNorth       -6.750e+04  2.605e+04  -2.591 0.009587 ** 
## RoadZoneriverSouth       -9.233e+04  2.497e+04  -3.698 0.000220 ***
## RoadZonesEast            -6.252e+04  2.592e+04  -2.412 0.015897 *  
## RoadZonesouth             1.825e+05  2.991e+04   6.101 1.14e-09 ***
## RoadZonesRight           -3.189e+04  2.981e+04  -1.070 0.284782    
## RoadZonesWest             1.845e+05  3.063e+04   6.023 1.84e-09 ***
## PctBlack                 -2.702e+02  2.482e+02  -1.088 0.276440    
## MedInc                    3.890e-01  2.116e-01   1.838 0.066090 .  
## PayPerEmp                 1.672e+03  7.413e+02   2.255 0.024168 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 339900 on 4500 degrees of freedom
## Multiple R-squared:  0.241,  Adjusted R-squared:  0.238 
## F-statistic: 79.39 on 18 and 4500 DF,  p-value: < 2.2e-16
regPredMed<-predict(modelFINAL,dfEC_medincome)

regPredVal_med<-
  data.frame(observed=dfEC_medincome$SalePrice.x,
             predicted=regPredMed)

regPredVal_med<-
  regPredVal_med%>%
  mutate(error=predicted-observed)%>%
  mutate(absError=abs(predicted-observed))%>%
  mutate(percentAbsError=abs(predicted-observed)/observed)

head(regPredVal_med)
##   observed predicted       error   absError percentAbsError
## 1   192000  177705.2  -14294.801  14294.801      0.07445209
## 2    40000  121522.7   81522.679  81522.679      2.03806698
## 3   425000  289515.9 -135484.060 135484.060      0.31878602
## 4   167000  161687.7   -5312.256   5312.256      0.03180992
## 5   325000  246102.0  -78898.019  78898.019      0.24276314
## 6   150000  112570.4  -37429.627  37429.627      0.24953085
#MAE
mean(regPredVal_med$absError)
## [1] 111010.9
#MAPE
mean(regPredVal_med$percentAbsError)
## [1] 0.6758317
#Predict for Low income neighborhood
regression_NOlow<-lm(fEC, dfEC_NOlow)
summary(regression_NOlow)
## 
## Call:
## lm(formula = fEC, data = dfEC_NOlow)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -765263 -112789  -33258   38348 6396351 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -1.662e+04  2.667e+04  -0.623 0.533088    
## Acrage.x                  6.502e+04  7.328e+03   8.873  < 2e-16 ***
## bedroomsunits_building.x  5.614e+03  5.280e+03   1.063 0.287697    
## otherrooms.x              1.972e+03  3.349e+03   0.589 0.556008    
## baths.x                   5.210e+04  5.730e+03   9.093  < 2e-16 ***
## LocalOwnedYES            -3.995e+04  1.027e+04  -3.891 0.000101 ***
## BuildingResiCount         2.092e+02  3.621e+01   5.778 7.90e-09 ***
## RoadZoneEAST             -1.285e+05  2.217e+04  -5.795 7.13e-09 ***
## RoadZonemiddle            3.284e+04  2.361e+04   1.391 0.164252    
## RoadZonenorth            -9.828e+04  2.564e+04  -3.832 0.000128 ***
## RoadZoneriverNorth       -1.067e+05  2.596e+04  -4.108 4.03e-05 ***
## RoadZoneriverSouth        3.908e+03  2.087e+04   0.187 0.851444    
## RoadZonesEast            -1.130e+05  1.569e+04  -7.199 6.72e-13 ***
## RoadZonesouth             1.375e+05  1.937e+04   7.097 1.41e-12 ***
## RoadZonesRight           -1.757e+04  1.902e+04  -0.924 0.355574    
## RoadZonesWest             7.721e+04  1.945e+04   3.971 7.24e-05 ***
## PctBlack                  1.164e+02  2.724e+02   0.427 0.669153    
## MedInc                    1.604e+00  1.810e-01   8.858  < 2e-16 ***
## PayPerEmp                 2.716e+03  5.479e+02   4.958 7.30e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 295600 on 6700 degrees of freedom
## Multiple R-squared:  0.2435, Adjusted R-squared:  0.2415 
## F-statistic: 119.8 on 18 and 6700 DF,  p-value: < 2.2e-16
regPredLow<-predict(modelFINAL,dfEC_lowincome)

regPredVal_low<-
  data.frame(observed=dfEC_lowincome$SalePrice.x,
             predicted=regPredLow)

regPredVal_low<-
  regPredVal_low%>%
  mutate(error=predicted-observed)%>%
  mutate(absError=abs(predicted-observed))%>%
  mutate(percentAbsError=abs(predicted-observed)/observed)

head(regPredVal_low)
##   observed predicted       error   absError percentAbsError
## 1    20000  66733.28   46733.284  46733.284      2.33666422
## 2   350000 217206.72 -132793.276 132793.276      0.37940936
## 3   312500 305058.87   -7441.129   7441.129      0.02381161
## 4   254738 352328.78   97590.785  97590.785      0.38310258
## 5   174000 186637.50   12637.498  12637.498      0.07262930
## 6   900000 136202.49 -763797.512 763797.512      0.84866390
#MAE
mean(regPredVal_low$absError)
## [1] 143788.1
#MAPE
mean(regPredVal_low$percentAbsError)
## [1] 0.9590584
`Hold out Neighborhood`<-rbind('High-income','Middle-income','Low-income')
`Hold out Neighborhood`<-data.frame(`Hold out Neighborhood`)

`Hold out Neighborhood`$MAE<-rbind(187940,111011,143788)
`Hold out Neighborhood`$MAPE<-rbind(0.515,0.676,0.959)

colnames(`Hold out Neighborhood`)<-c('Hold-out Neighborhood', 'MAE', 'MAPE')
Hold Out Neighborhood MAE MAPE
High-income 187940 0.515
Medium-income 111011 0.676
Low-income 143788 0.959

Discussion

Based on the “spatial cross-validation” results obtained from holding out and predicting for high, medium and low income neighborhoods, it is not possible to say that the model is generalizable across space. If it were to be generalizable, the MAPE values would be around the same value. The fact that they are different when predicting for each neighborhood indicates that the model is not generalizable. It is however, possible to say that the model predicts more accurately for High-income and Middle-income neighborhoods than it does for Low-income neighborhoods.

#Scatterplots#
library(ggplot2)

head(regPredVal_high)
##   observed predicted     error  absError percentAbsError
## 1   537000  445343.9 -91656.08  91656.08       0.1706817
## 2   285000  365360.8  80360.85  80360.85       0.2819679
## 3   221000  601886.7 380886.70 380886.70       1.7234692
## 4   370000  297094.7 -72905.32  72905.32       0.1970414
## 5   376000  556341.4 180341.35 180341.35       0.4796313
## 6   250000  465061.1 215061.11 215061.11       0.8602444
regPredVal_high$HoldOut<-'High Income'
regPredVal_low$HoldOut<-'Low Income'
regPredVal_med$HoldOut<-'Medium Income'

names(regPredVal_high)
## [1] "observed"        "predicted"       "error"           "absError"       
## [5] "percentAbsError" "HoldOut"
names(regPredVal_low)
## [1] "observed"        "predicted"       "error"           "absError"       
## [5] "percentAbsError" "HoldOut"
names(regPredVal_med)
## [1] "observed"        "predicted"       "error"           "absError"       
## [5] "percentAbsError" "HoldOut"
Combined<-rbind(regPredVal_high[,c(1,2,6)],
                regPredVal_low[,c(1,2,6)],
                regPredVal_med[,c(1,2,6)]
                )

head(Combined)
##   observed predicted     HoldOut
## 1   537000  445343.9 High Income
## 2   285000  365360.8 High Income
## 3   221000  601886.7 High Income
## 4   370000  297094.7 High Income
## 5   376000  556341.4 High Income
## 6   250000  465061.1 High Income
ggplot(Combined,aes(x=predicted, y=observed))+
  geom_point(color='indianred')+
  geom_abline(slope = 1, size=0.5, col='dark grey', linetype='dashed')+
  ylab('Observed values')+
  xlab('Predicted values' )+
  facet_wrap(.~HoldOut, ncol=1)+
  theme_light()+
  theme(legend.position='none')

7. Concluding Remarks and Reflections from Class Discussion

Overall, the final model did not prove to be an effective model for predicting home prices in Nashville. Our model predicted, on average, 38% of the variation in prices in Nashville.

Despite the inaccuracy and weak model predictions, we were able to determine that our individual predictors of choice were considered to be significant. Most notably, the more important predictors variables in the model were the number of rooms (bedrooms, bathrooms and other rooms). This might suggest that a different combinations of predictors involving these might yield better model performance. Due to the weak predictive power of our model, we would not recommend our model to Zillow. However, knowing that the predictors we worked with were significant variables for determining home sale prices, we would recommend Zillow to consider incorporating these factors into their model to further improve the predictive accuracy of house price predictions.

Based on the discussions during class time, we thought that our failure to incorporate average house price in the area each property resided in contributed most to the failure of our model. This feature, in retrospect, is an obvious application of Tobler’s First Law - that near things are more similar than farther things. In this context, the prices of neighbouring properties reasonably relates to the price of each property. Incorporating this feature, we think, is an obvious solution to accounting for the largest skew in observed house prices across Nashville as a whole.


  1. https://fox17.com/news/local/new-projections-show-tennessee-population-will-see-big-changes-by-2040

  2. Nashville Chamber Regional Stats

  3. Imputing Missing Data with R